Matrix calculus¶

Useful definitions and notations¶

We will treat all vectors as column vectors by default.

Matrix and vector multiplication¶

Let $A$ be $m \times n$, and $B$ be $n \times p$, and let the product $AB$ be

$$ C = AB $$

then $C$ is a $m \times p$ matrix, with element $(i, j)$ given by

$$ c_{ij} = \sum_{k=1}^n a_{ik}b_{kj} $$

Let $A$ be $m \times n$, and $x$ be $n \times 1$, then the typical element of the product

$$ z = Ax $$

is given by

$$ z_i = \sum_{k=1}^n a_{ik}x_k $$

Finally, just to remind:

$C = AB \quad C^\top = B^\top A^\top$
$AB \neq BA$
$e^{A} =\sum\limits_{k=0}^{\infty }{1 \over k!}A^{k}$
$e^{A+B} \neq e^{A} e^{B}$

Gradient¶

Gradient Let $f(x):\mathbb{R}^n→\mathbb{R}$, then vector, which contains all first order partial derivatives:

$$ \nabla f(x) = \dfrac{df}{dx} = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix} $$

Hessian¶

Let $f(x):\mathbb{R}^n→\mathbb{R}$, then matrix, containing all the second order partial derivatives:

$$ f''(x) = \dfrac{\partial^2 f}{\partial x_i \partial x_j} = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \dots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2 \partial x_2} & \dots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \dots & \frac{\partial f}{\partial x_n \partial x_n} \end{pmatrix} $$

But actually, Hessian could be a tensor in such a way: $\left(f(x): \mathbb{R}^n \to \mathbb{R}^m \right)$ is just 3d tensor, every slice is just hessian of corresponding scalar function $\left( H\left(f_1(x)\right), H\left(f_2(x)\right), \ldots, H\left(f_m(x)\right)\right)$

Jacobian¶

The extension of the gradient of multidimensional $f(x):\mathbb{R}^n→\mathbb{R}^m$ :

$$ f'(x) = \dfrac{df}{dx^T} = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \dots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \dots & \frac{\partial f_m}{\partial x_n} \end{pmatrix} $$

Summary¶

$$ f(x) : X \to Y; \;\;\;\;\;\;\;\; \frac{\partial f(x)}{\partial x} \in G $$

X	Y	G	Name
$\mathbb{R}$	$\mathbb{R}$	$\mathbb{R}$	$f'(x)$ (derivative)
$\mathbb{R}^n$	$\mathbb{R}$	$\mathbb{R^n}$	$\dfrac{\partial f}{\partial x_i}$ (gradient)
$\mathbb{R}^n$	$\mathbb{R}^m$	$\mathbb{R}^{n \times m}$	$\dfrac{\partial f_i}{\partial x_j}$ (jacobian)
$\mathbb{R}^{m \times n}$	$\mathbb{R}$	$\mathbb{R}^{m \times n}$	$\dfrac{\partial f}{\partial x_{ij}}$

named gradient of $f(x)$ . This vector indicates the direction of steepest ascent. Thus, vector $−\nabla f(x)$ means the direction of the steepest descent of the function in the point. Moreover, the gradient vector is always orthogonal to the contour line in the point.

General concept¶

Naive approach¶

The basic idea of naive approach is to reduce matrix\vector derivatives to the well-known scalar derivatives. One of the most important practical trick here is to separate indicies of sum ($i$) and partial derivatives ($k$). Ignoring this simple rule tends to produce mistakes.

Guru approach¶

The guru approach implies formulating a set of simple rules, which allows you to calculate derivatives just like in a scalar case. It might be convinient to use the differential notation here.

Differentials¶

After obtaining the differential notaion of $df$ we can retrieve the gradient using following formula:

$$ df(x) = \langle \nabla f(x), dx\rangle $$

Than, if we have differential of the above form and we need to calculate the second derivative of the matrix\vector function, we treat "old" $dx$ as the constant $dx_1$, than calculate $d(df)$

$$ d^2f(x) = \langle \nabla^2 f(x) dx_1, dx_2\rangle = \langle H_f(x) dx_1, dx_2\rangle $$

Properties¶

Let $A$ and $B$ be the constant matrices, while $X$ and $Y$ are the variables (or matrix functions).

$dA = 0$
$d(\alpha X) = \alpha (dX)$
$d(AXB) = A(dX )B$
$d(X+Y) = dX + dY$
$d(X^\top) = (dX)^\top$
$d(XY) = (dX)Y + X(dY)$
$d\langle X, Y\rangle = \langle dX, Y\rangle+ \langle X, dY\rangle$
$d\left( \dfrac{X}{\phi}\right) = \dfrac{\phi dX - (d\phi) X}{\phi^2}$
$d\left( \det X \right) = \det X \langle X^{-\top}, dX \rangle $
$d \text{tr } X = \langle I, dX\rangle$
$df(g(x)) = \dfrac{df}{dg} \cdot dg(x)$

References¶

solutions
Good introduction
The Matrix Cookbook
MSU seminars (Rus.)
Online tool for analytic expression of a derivative.

Example 1¶

Find $\nabla f(x)$, if $f(x) = \dfrac{1}{2}x^TAx + b^Tx + c$.

Example 2¶

Find the gradient $\nabla f(x)$ and hessian $f''(x)$, if $f(x) = \dfrac{1}{2} \|Ax - b\|^2_2$.

Example 3¶

Find $\nabla f(x), f''(x)$, if $f(x) = -e^{-x^Tx}$.

Example 4¶

Find the gradient $\nabla f(x)$ and hessian $f''(x)$, if

$$ f(x) = \ln \left( 1 + \exp\langle a,x\rangle\right) $$

Example 5¶

Find $f'(X)$, if $f(X) = \det X$

Note: here under $f'(X)$ assumes first order approximation of $f(X)$ using Taylor series:

$$ f(X + \Delta X) \approx f(X) + \mathbf{tr}(f'(X)^\top \Delta X) $$

Example 6¶

Calculate: $\dfrac{\partial }{\partial X} \sum \text{eig}(X), \;\;\dfrac{\partial }{\partial X} \prod \text{eig}(X), \;\;\dfrac{\partial }{\partial X}\text{tr}(X), \;\; \dfrac{\partial }{\partial X} \text{det}(X)$

Example 7¶

Find $\nabla f(X)$, if $f(X) = \langle S, X\rangle - \log \det X$

Example 8¶

Calculate the derivatives of the loss function with respect to parameters $\frac{\partial L}{\partial W}, \frac{\partial L}{\partial b}$ for the single object $x_i$ (or, $n = 1$)